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Abstract — We explore the hypothesis that it is possible to obtain 
information about the dynamics of a blog network by analysing 
the temporal relationships between blogs at a semantic level, 
and that this type of analysis adds to the knowledge that can be 
extracted by studying the network only at the structural level 
of URL links. We present an algorithm to automatically detect 
fine-grained discussion topics, characterized by n-grams and time 
intervals. We then propose a probabilistic model to estimate the 
temporal relationships that blogs have with one another. We 
define the precursor score of blog A in relation to blog B as 
the probability that A enters a new topic before B, discounting 
the effect created by asymmetric posting rates. Network-level 
metrics of precursor and laggard behavior are derived from these 
dyadic precursor score estimations. This model is used to analyze 
a network of French political blogs. The scores are compared 
to traditional link degree metrics. We obtain insights into the 
dynamics of topic participation on this network, as well as the 
relationship between precursor/laggard and linking behaviors. 
We validate and analyze results with the help of an expert on 
the French blogosphere. Finally, we propose possible applications 
to the improvement of search engine ranking algorithms. 

I. Introduction 

For cultural anthropologists, understanding fads, trends, or, 
generally, cultural similarity, essentially comes to explain- 
ing "the capacity of some representations to propagate until 
becoming precisely cultural, that is, revealing the reasons 
of their contagiosity" (TJ. This type of research programme 
admittedly assumes the possibility of, on one hand, describing 
representations in a consistent manner, and, on the other 
hand, apprehending processes of social mediation. Defining 
consistent cultural items is indeed crucial to describe adop- 
tion of similar ideas, behaviors, opinions, topics, etc. — the 
literature proposes here a large variety of concepts, such as 
using same bags of terms, having identical opinion vectors, 
duplicating references (for instance to digital content such as 
online video or news articles, tagged by the same URL) or, 
more loosely, being "infected" by spreading "memes". Second, 
describing social mediation requires to understand jointly 
how some types of social network configurations and some 
types of interactions may or may not favor the transmission, 
reproduction or adoption of behaviors, ideas, etc. Again, a vast 
amount of research has been concerned with normative models 
or descriptive protocols aimed at understanding which kind of 



individuals were more or less likely to pass on some pieces of 
information, and which type of network positions could favor 
the diffusion of some items. 

By relying on large-scale datasets on which individuals 
talk about what and when, specifically in online communi- 
ties, social computing has recently contributed to this broad 
research programme by intensively developing two pragmatic 
streams of study: detection of "topics", and characterization of 
"informational cascades". Studies focused on topic detection 
explore bursts and regularities of behavior or term use [e.g., 
13, sometimes in order to infer trends in the general population 
(3j |4| . In all these studies, cultural representations are assumed 
to be extremely atomic, i.e. based on a single behavior (a vote), 
item (a reference, a URL), apprehending cultural contagion 
pretty much similarly to disease contagion — to the notable 
exception of [5] who gather similar sentences into clusters 
of quotes, getting closer to the polymorphism of cultural 
representations emphasized by anthropologists. 

On the other hand, studies on informational cascades cur- 
rently adopt a structural stance, migrating from the "two- 
step-model" to more recent arguments underlining the im- 
portance of more horizontal, less hierarchical patterns EE). 
Importantly, in this persective, information flows and diffusion 
paths are characterized along a given social network, available 
a priori. In many cases however, and certainly in blogs 
in particular, much of the information regarding the whole 
underlying interaction infrastructure is simply missing (be it in 
terms of news media readership, email exchanges and broadly 
any type of non-blog-based online conversation, phone calls, 
etc.). 

In this paper, we aim at bridging these rather separate 
streams by adopting (i) a looser view on representations, as 
stories or cultural attractors |8] |9) rather than atomic items 
and, (ii) by considering information sources, in our case 
bloggers, as sensors in a social system - in particular as 
representatives of topics discussed in the society - so as 
to suggest possible/implicit information diffusion flows or, 
at least, precedence relationships. As an aside, the current 
contribution also considers observed social networks as effects 
rather than just causes of information diffusion. 

We thus propose to identify topic classes, exhibit temporal 



precedence relations between sources based on significant 
plausibility for an individual to address a topic before others 
do, and eventually compare this structure with the partial 
network of interactions constituted by explicit links among 
bloggers. Classical authority measures are found to have only 
a weak correlation with our approach, which rather exhibits 
potential online whistleblowers. The next section presents 
an overview of the relevant literature, while Sec. [HI] details 
the empirical protocol used to identify topics. Sec. [IV] then 
describes our approach to compute probable precedence rela- 
tionships; results are discussed and reframed in Sec. [V] 

II. Related work 

A. Temporal detection of topics/bursts. 

Topic characterization from (online) text corpora generally 
relies on terms, n-grams (i.e. a basic linguistic unit of n 
terms) or sentence segments. Once basic text units have been 
defined and extracted, topics are appraised both quantitatively 
and temporally, essentially by describing "how much on 
which period of time they are being discussed". This led to 
distinguishing bursts of interest ("spikes") J2), as opposed 
to continuous discussions ("chatters") around topics |10|. 
Models of the temporal ifTTl or spatial [12| regularities in 
the usage of topics have been subsequently developed, up 
to infering and predicting accurate information regarding the 
whole population behavior 

Another stream of research has focused on improving the 
qualification of topics: for instance, by detecting whether 
issues are addressed in a positive light or not [the so-called 
field of "sentiment analysis", see[T3] among others]; or, closer 
to our issues, by managing to group portions of text into 
classes of similar content |5| — thereby implicitly addressing 
one common critique among social scientists regarding the 
atomism of "memes" as cultural items. 

B. Precedence and influence 

Empirical studies of influence generally rely on interaction 
networks, using relational information to characterize conta- 
gion paths, and following a long tradition in mathematical 
sociology of social network-based models of information 
diffusion. As regards blogspace in particular, after initial 
descriptions of the underlying social network structure [e.g. 
[T4l who also discuss bursty behavior in link creation], [15] 
has been one of the first studies to specifically focus on the 
structure of link cascades. In a previous study, [ 16] describe 
more precisely local influence patterns such as the relationship 
between e.g. holistic patterns and the weakness of links, in 
Granovetter's sense. [17], on the other hand, use various social 
network structures to show that possible influence of a given 
blog is best described by strictly structural page-rank-style 
measures. 

Since influence is obviously related to precedence rela- 
tionships, several papers focus rather on temporal behavioral 
precedence. For instance, the authors of |fl8ll exhibit explicit 
temporal dependencies on a email transmission network by 
characterizing possible shortcuts in information paths, because 



a dyad (A,B) could communicate less quickly than (A,C) and 
(C,B) separately do. 

In terms of intertwining social network structure and prece- 
dence/influence, the relationship between topology and pre- 
cursors or laggards had also been explored in | 19l . but with 
the assumption that the social network is known a priori, and 
by monitoring the adoption of a unique yes-or-no behavior. As 
said before, it is likely that a lot of information about the social 
structure is missing in most of the above studies, which con- 
sider the (given) social network as the substrate of information 
propagation. By assuming that the social structure describes 
only a non-significant fraction of all possible interaction links 
and contagion paths in the context of (for instance) political 
discussions, we basically wish to suggest that, here, the social 
network could just be a secondary material in the study of 
contagion. 

Some studies do exactly so and exhibit influence relation- 
ships from usage information only: for instance in GUI a 
Markov Chain Model is used to characterize which topics are 
most likely to transition into others, using data extracted from 
scientific bibliographic databases. Back to blogs, "probable" 
content diffusion paths could be exhibited in ETI by using 
classifiers based upon blog features: for instance, having 
similar citing and content posting patterns; however, the anal- 
ysis does not seem to make use of topic dynamics per se. 
Another reference [22] introduces an analysis which integrates 
more semantics, essentially in order to design automatic feed 
recommenders — which appears nonetheless to be still based 
on structural features (in-degree statistics) even if a filter is 
applied over general topics (politics vs. IT, etc.). 

On the whole, and in the context of partial social network in- 
formation, the issue of the detection of implicit, non-structural 
influence flows using temporal precedence in addressing topics 
remains a pending question. 

III. Unit of activity detection 

We are interested in identifying topics of discussion for 
which we can later analyse the temporal relationships of their 
participants. Such topics must have two characteristics to be 
relevant to our analysis: to have well defined time boundaries 
within our observation period and to be maintained by the 
participation of several blogs. If these two constrained are 
respected then we are observing what we will call a well 
defined "unit of activity". We empirically define a method that 
identify bursty topics which meet these constraints. 

In (3), research related to the problem of topic detection 
is classified into two main categories: probabilistic models to 
identify long-range trends in general topics and the use of 
rared named entities to study short information cascades. We 
are not interested in long-range, general topics, nor in having 
to rely on the occurrence of very specific, rare strings. Instead, 
our goal is to identify topics that can identified by a set of n- 
grams and a well bounded period of time, and that represent 
simple, self-contained units of activity. 

We propose a rather holistic approach, that takes advantage 
of both the textual content of blogs posts, and the times at 



which these posts where published. 

The process of topic detection we propose consists of 
a classical sequence of treatments that we perform on our 
dataset: 

1) Part-of-speech tagging and lemmatisation of each post's 
title and content in order to enumerate every relevant 
n-grams in the corpus. 

2) Detection and filtering of n-gram temporal bursts. 

3) Merging of redundant n-gram bursts into unique topics. 

A. linguistic treatment 

We perform the first step using the TreeTagger tool ll23l . 
In this step we generate a new version of each posts title and 
textual content, where each word is lemmatised and augmented 
with a part-of-speech tag. 

We then divide the corpus of text generated by the previous 
step into chunks, delimited by punctuation marks. Afterwords, 
we find all the n-grams that occur in the chunks produced 
by the previous step. This search is constrained by a set of 
rules, as to not generate an intractable amount of n-grams, and 
explore only cases we believe are likely to lead to meaningful 
topics. The rules are the following: 

• N-grams must have two or more words. 

• An n-gram must contain at lease one noun. 

• All words that are not nouns, verbs, adjectives or numbers 
are discarded. 

• All n-grams that contain words in a special set called 
stop-words list are rejected. 

These rules are empirical, having been obtained by experimen- 
tation with real datasets. The word set in the last rule contains 
words that have a strong temporal meaning, and that would 
later on lead to the detection of meaningless temporal bursts 
of usage. We used a set containing names of months, days of 
the week and holiday seasons (like Christmas), in both French 
and English. 

B. Temporal bursts detection 

In the second phase, we analyse the pattern of occurrence 
of each n-gram, dividing the period of observation into bursts 
of activity. For this purpose, we devised an algorithm that 
iteratively divides the timeline into intervals, aiming at the 
maximization of a value we will call the burst ratio. Let 
us consider an ordered set T = {to, t\, t n } (in ascending 
order), where each element is the time of an occurrence of the 
n-gram. Furthermore, any two consecutive elements of T must 
originate from different blogs. This guarantees that a burst can 
only be maintained by the participation of multiple blogs. 

We are interested in partitioning T into subsets which 
correspond to temporal bursts. Let us consider the ordered 
set = {9o,6i, ...,8 n } where 9k = 1 if element tk is 
the last element of a burst, and 9k = otherwise. Each 
time 9k equals 1 it means that the burst ends at tk- Given 
a partition of the sequence of a n-gram into bursts, it 
is straightforward to compute the time-lag between the end 
of a burst and the beginning of the next burst or the time- 
lag between two occurrences inside the same burst. We can 
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Figure 1. Example of a sequence of occurrences of a given ngram. The 
ordered sets T and are depicted. Inter-bursts and intra-burst intervals are 
represented by arrows (respectively straight and curved). 

compute the average time-lag between two consecutive bursts 
or the average interval inside each burst on the whole timeline 
as follows: 
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We also define the minimum inter-burst interval m, >(T, 0) 

m,_).(T,9) = miii{,<|T|,« F i)(!.+i - U) 
We then define the burst ratio, p(T, 0) as: 
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Simply put, p(T, 0) is the ratio of the mean time interval 
between bursts to the mean time interval between elements 
inside bursts. 

On algorithm [T] we present the pseudo-code that describes 
the clustering method. The process is started with all the 
elements of initialized to 0, meaning that in the initial state, 
all n-gram occurrences are considered to belong to a single 
burst. The algorithm iteratively tries to add new divisions to 
0, keeping the ones that increase the burst ratio, until no 
further improvement is possible. 

Parameters a and (3 determine, respectively, the minimum 
burst ratio and interval between bursts (in days) that are ac- 
cepted. These parameters allow us to prevent the formation of 
bursts that are not sufficiently separated, both in relation to the 
average interval between n-gram occurrences and in absolute 
value. For our purposes, we experimentally determined a = 5 
and (3 = 5 to be good values. 

We devised our own burst detection algorithm instead of 
using one of the available ones, due to the specific require- 
ments of our approach. For example, the weighted automaton 



Algorithm 1 Pseudo-code of algorithm to perform temporal 
clustering of n-gram occurrences into bursts. 

stop <— False 

while stop = False do 

best_burst_ratio < 1 

bestjpostion < 1 

for pos = 1 to |T| do 
if 0_pos = then 
auxj& <— 

burstjratio <— p(T, auxjQ) 

min_inter_interval <— m, > (T, aux_&) 

if burstjratio < a or min_inter_interval < /3 then 

burst j-atio <— 
end if 

if burstjratio > bestjour stjratio then 
best_bur stjratio <— burstjratio 
bestjpos <— pos 
end if 
end if 
end for 

if bestjpos > then 

®best_pos 4 1 

else 

stop <— True 
end if 

end while 



model described in [2| is very suitable for detecting bursts 
at quantifiable levels of intensity, but does not lend itself 
to the detection of bursts with well defined limits. For the 
probabilistic model we are going to describe in the following 
section, it is crucial that we consider bursts with well defined 
limits, as not to lose initial or late arrivals. Our algorithm 
detects cases where the activity on a certain n-gram set can 
be characterized by intervals with a sufficient level of activity, 
separated by large enough intervals of no activity. 

Finally we filter the n-gram bursts, only accepting the ones 
that meet the following criteria: 

• A minimum number of blogs participating in the burst of 
4. 

• A minimum average time between posts participating in 
the burst of 1 hour. 

• A maximum average time between posts participating in 
the burst of 1 day. 

• A minimum burst duration of 3 days. 

• A maximum total duration of all the bursts of the n-gram 
of 1 month. 

The purpose of these rules is to end up with n-gram bursts 
that are more likely related to a real topic. We discard bursts 
that are too sparse, too dense, too short lived or defined by an 
n-gram that is too common. 

C. Merging n-gram bursts into topics 

Finally, on the last phase, we extract a set of topics from 
the set of n-gram bursts that resulted from the previous 
step. We define a topic as a tuple ({go, gi, g n }, t, t'), 
consisting of a set of n-grams occurring between times t 
and t! . Topics are defined with the minimum possible set 
of n-grams for maximum generality. Figure [2] illustrates on 
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Figure 2. Example of selection of n-gram bursts to define a topic. Bursts in 
sold line are selected for the topic definition, while bursts in dashed lines are 
discarded. 

a real example how the n-gram bursts are selected to define 
a topic. The underlying idea is the following: consider two 
n-gram bursts, defined by n-grams g a and gi,, occurring over 
time intervals [t a ,t' a } and [t&,tl]. Furthermore, consider that 
the sequence of words in n-gram gi, is a sub-sequence of 
the sequence of words in n-gram g a , and that t a > tf, and 
t' a < t' b . Referring to figure [2] this could be exemplified 
by g a = "region avoir apporter contribution debat" and 
gb = "apporter contribution debat". We assume that, in this 
kind of situation, it is very likely that both bursts belong to the 
same topic, gi, is more general than g a , because it includes all 
the cases covered by g a , while the opposite is not necessarily 
true. 

We transverse the entire set of n-gram bursts, in descending 
order of the number of words contained in their n-gram. For 
each burst, we look for bursts ahead in the set with n-grams 
that are a sub-sequence of the first one, and with time intervals 
that contain the interval of the first one. If such bursts are 
found, the original burst is discarded. If one of the bursts found 
is already assigned to a topic, we also assign the other bursts 
found to that topic, otherwise we assign all bursts found to a 
new topic. 



IV. Probabilistic Precedence Scoring 

After the process described in the previous section, we now 
have a set of topics, and know which blogs participated in 
each topic and at what time. We are now in the position of 
defining a probabilistic model that estimates the tendency that 
blogs have to participate in topics before other blogs. 

We will start by defining a dyadic precursor score from 
blog b to blog b'. We will call this score j(b,b'). Let us 
define A as the set of all topics where both blogs participate, 
and Y as the subset of A where the first participation of b 
precedes the first participation of b'. We also define C as a 
vector of probabilities. Each element of C is the probability 
that b participates on a topic before b' by chance. We will 
detail later how these probabilities are computed. We know 
define the likelihood of j(b, b') = p, given A, Y and C: 



\('y(b,b , )=p\A,Y,C) = 

\( 1 (b,b')=p\A,Y,C,Z,R) 



£ 

ZUR=Y 

zni?,=0 



(3) 



The likelihood in equation [3] is defined as the sum of the 
likelihoods for all possible hypothesis of the appearances of b 
before b' being caused by a temporal relationship or by chance. 
The set Y of topics where the first participation of b precedes 
the first participation of b' can be decomposed as the union of 
the set Z of topics where b is assumed to display a behavior 
of precedence over b', and the set R of topics where b is 
assumed to precede b' by chance. We define the likelihood of 
each hypothesis as: 

A( 7 (6, b') = P \A, Y, C, Z, R) = P Z (A, Z,p) ■ P R (A, R, C) 

(4) 

Pz {A, Z, p) is the probability that b precedes b' in the topics 
in Z and not in the topics in A \ Z, given a probability of a 
precedence relationship of b over b' of p. Pr(A, R, C) is the 
probability that b precedes b' by chance for the topics in R, 
and not for the topics in A \ R, given C. These probabilities 
are defined as: 



p Z (A,z,p)=p^(i- P r^ 

Pr{A,R,C)= Y[C r [] 1 -°r 
r£R reA\R 



(5) 
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Now we have to define how to compute the probabilities C r 
that topic r is mentioned by b before b' . We compute these 
probabilities by taking into account the total number of posts 
published by each blog during the time interval of the topic, 
in the following way: 



Np(b, [t s (r);t e (r)]) 



(7) 



Np(b, [t s (r);t e (r)}) + Np(V, [t s (r);t e (r)}) 

t s (r) is the time of the beginning of topic r and t e (r) is the 
time of its end. Np(j, t, t') gives the number of posts published 
by blog j between times t and if. Simply, this expression 
reflects the idea that, the higher the number of posts of blog 
b as compared to the total number of posts from both blogs 
in the time interval, the more likely b is to publish the first 
post on the topic by chance. We do not consider the overall 
posting rates of the blogs, as these change over time. 

The computation of the likelihood expressed in [5] suffers 
from combinatorial explosion. In fact, the number of com- 
putations that have to be performed to calculate A(7(6, b') = 
p\A, Y, C, Z, R) scales exponentially with \Y\. For this reason, 
when \Y\ is above 15, we resort to an estimation based on 
sampling. 

Finally, we estimate 7(6, b') by calculating the mean of the 
possible values it can take (7(6, b') — > [0, 1]), weighted by 
their likelihood: 



, „ tilh(b,b')=p\A,Y,C)-p-dp 
Mb, b') = J ° , wv ' — 17 (8) 
J^l( 1 (b,b')= P \A,YC)-dp 

Not having an analytical solution for equation [8] we use 
Monte Carlo integration. 

Having a way to compute dyadic precursor scores, we are 
now interested in scoring the blogs according to their overall 
precursor/laggard behaviors over the entire network. For this 
purpose, we will define two metrics: the global precursor score 
(P) and the laggard score (L). 

A dyadic precursor score 7(6, b') can be interpreted as the 
probability that a post from blog b' participates in a topic 
under a temporal relationship with blog b, where b precedes 
b', given that both blogs are known to participate in that topic. 
We can remove the topic co-participation assumption using 
Bayes' theorem. Considering M to be the event of the post 
participating in the topic under the temporal relationship, and 
H to be the event of the post for blog b' participating in a 
topic where blog b also participates: 



7(6,6') =P r (M\H) 



P r {M\H) 



P r (H\M)P r (M) 

Pr(H) 



(9) 



(10) 



w{b,b') = P r (M) = P r {M\H)P r {H) = ~/(b,b')P r (H) 

(11) 

We will call ui(b, b') the adjusted dyadic precursor score. 
Notice that P r (H\M) = 1, because if the post participates in 
a topic under a temporal relationship with the other blog, the 
blogs will necessary co-participate in that topic. 

We define the global precursor score for a blog b (P(b)) 
as the mean of all adjusted dyadic precursor scores where b 
is the origin, and the laggard score (L(b)) as the mean of all 
adjusted dyadic precursor scores where b is the target. Being 
B the set of all blogs in the network: 



1 1 b'eB\{b} 

1 1 b'eB\{b} 
V. Results and Discussion 



(12) 
(13) 



The protocol described in the previous sections was applied 
to a dataset generated from a crawl of the French political 
blogosphere, consisting of 916 blogs, between the days of 
October 1 st 2009 and February ll*' 1 2010. During this period, 
40, 191 posts were published, containing 16, 909 citation links 
to other blogs in the network. We applied our topic detection 
process on this data and identified 2, 619 different topics. 

We then computed the global precursor and laggard scores 
according to the process described in the previous section for 
each blog that published at least 7 posts during the whole 
observation period. We discarded nearly 300 blogs with very 




Table I 

Significance of mean in-degree relationships for classes of 
blogs determined according to precursor and laggard score 
intervals. 



J.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 
precursor score (P) 

Figure 3. Scatter plot of precursor (P) vs. laggard (L) scores for all blogs 
in the network. 

low posting rates because of the noise they may introduce into 
the computation of the global scores. 

Figure [3] shows a scatter plot of the blogs, positioned in the 
plane according to their global precursor and laggard scores: 
P and L. This plot gives us an overview of the structure of 
the network in terms of precursor/laggard behaviors. It can be 
observed that there is a dense cluster of blogs near the origin, 
with the distribution of blogs rarefying in both the x and y 
directions. 

A blog may be situated in the low scores cluster for different 
reasons. It could be that it does not tend to participate in 
popular topics (which also means that the topics it discusses 
are not spread through the network), or it could be that it 
maintains relationships of influence with other blogs which are 
close to being symmetrical. This type of relationship between 
two blogs makes it approximately equally likely that each blog 
influences the other to enter a topic. Our scores are not capable 
of distinguishing a symmetrical influence relationship from an 
indirect relationshipM 

In the study of blog networks, it is common to establish 
popularity metrics based on the URL links that point to a blog. 
We compute the in-degree of a blog as the number of blogs 
that link to it at least once during the observation period, as 
well as the classical page rank. Our goal is to compare those 
metrics based on the topology of the hyperlinks network with 
our temporal semantic based scores. 

Figure [4] shows box plots of in-linking and page rank per 
interval of precursor score. The two plots present similar 
shapes, showing an increase in both in-link degrees and page 
ranks up to the third bar. On the fourth bar there is a clear 
decrease, suggesting that the precursor behavior is positively 
correlated with blog popularity only up to a certain point. 

In figure [5] we plot in-linking per interval of laggard score. 
This plot is more noisy and the pattern is less clear than 
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the previous one. Higher laggard scores appear to have a 
detrimental effect on link popularity. Although not shown, 
a similar pattern was found when comparing page ranks to 
laggard scores. 

In order to derive general principles, we divided the blog 
set into four classes. Each class is characterized by a high 
or low precursor score and a high or low laggard score. A 
precursor score is considered low if it is equal or lesser than 
the mean precursor score for the entire set (P £ [0, P[), and 
high otherwise (P &]P, 1]). Laggard scores are classified in 
an analogous fashion. We use the notation p for low precursor, 
P for high precursor and so on. The class PI, for example, is 




[0.0000, 0.0018[ [0.0018, 0.0036[ [0.0036, 0.0053[ [0.0053, 1] 
precursor score (P) 



'Since the blog network is not a closed system, two blogs could have a 
very similar set of external influences, leading to the same temporal patterns 
they would display if influencing each other in a symmetrical way. 



Figure 4. Above: box plots of in-linking distributions for intervals of 
precursor scores. Below: box plots of page rank distributions for intervals 
of precursor scores. 
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Figure 5. Box plots of in-linking distributions for intervals of laggard scores. 

the one containing blogs with an high precursor score and low 
laggard score. 

In each cell of table|l]we perform a comparison between the 
mean in-link degree of each class. The statistical significance 
of the differences was determined using Wilcoxon rank sum 
tests. We use a number of * symbols to denote the level of 
significance found. One * if p — value < 0.05, two if p — 
value < 0.01 and three if p — value < 0.001. The mean 
in-degrees for classes are shown in row and column headers. 

When comparing the two classes with low laggard scores, 
the one with an high precursor score has a higher mean 
in-degree. The same is true of the two classes with high 
laggard scores. When comparing the two classes with a low 
precursor score, the one with the low laggard score has the 
higher mean in-degree. In the two cases where no significance 
was found, the p-value was very close to 0.05, suggesting 
that the relationships are likely true, but we have insufficient 
data to be certain. This confirms that higher precursor scores 
and lower laggard scores have a positive effect on in-linking. 
These results also show that the two scores are not just 
reflecting the effect of participating in discussions. In fact, 
both scores require higher participation for higher values, but 
have opposite effects. 

It is clear, however, that these general principles do not tell 
the whole story. The box plots show that, despite the general 
principles, blogs with high precursor scores are not necessarily 
rewarded with high in-link degrees. 

This becomes more obvious by observing the hexagonal 
binning plot, shown in figure [6] It displays the mean in-linking 
per region of precursor and laggard scores. The darker the 
color, the higher the in-linking mean. It clearly confirms for 
example that higher precursor score does not guarantee higher 
in-degree. 

To validate our protocol and experimental results, we gen- 
erated four lists of ten blogs. We determined the position of 
each blog on a plane, where dimension x is the precursor 
score, and y the link in-degree. Both axis were converted to a 
logarithmic scale and normalized to [0, 1] intervals. From this 
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Figure 6. Hexagonal binning plot displaying mean in-linking per region of 
precursor and laggard scores. The darker the color, the higher the in-linking. 

spatial distribution, list 1 contains the blogs closest to point 
(0, 0) - low precursors, low in-degree; list 2 the blogs closest to 
(0, 1) - low precursors, high in-degree; list 3 the blogs closest 
to (1, 0) - high precursors, low in-degree and list 4 the blogs 
closest to (1,1) - high precursors, high in-degree. 

We then provided these four lists to an expert on the French 
blogosphere. She had no prior knowledge of our classification 
process. We simply asked her if she could notice any signifi- 
cant pattern inside groups. She described blogs of list 1, which 
belong to the category of low precursor and low in-degree, 
as very "small" blogs essentially concerned with regional or 
local issues. According to her, list 2 (low precursors, high in- 
degree) is typically composed of experienced bloggers who 
emerged during the last presidential election in 2008 and now 
gather together despite their political differences. As such their 
pattern of linking is similar to a "rich-club" which may explain 
their high in-degree in spite of their low precursor score. Blogs 
which have high precursor score and low in-degree (list 3) 
are exclusively made of copycats. These sites are basically 
systematically relaying the media or making reviews of regular 
papers on the web. The presence of such behavior in the 
dataset incidentally explains the sharp decline of mean in- 
degree and page rank among blogs with highest precursor 
scores that we observed previously (Fig. 0J. The fourth list 
is composed of high precursors and high in-degree blogs. All 
of them have been described by the expert as very active in 
political contestation, both from the left and the extreme right, 
against the government policy and, more broadly, against the 
current political balance. 

VI. Conclusions 

In this work, we strived to extract quantifiable metrics from 
the wealth of semantic information contained in blogs. We 
presented a method for the detection of bursts of activity at the 
semantic level, that was tested on a real data set and shown ca- 
pable of identifying topics characterized by n-grams and time 
intervals. We then described a probabilistic model to quantify 



temporal relationships between blogs. Dyadic precursor scores 
are able to quantify temporal relationships between pairs of 
blogs, where one tends to enter a topic before the other, 
discounting the effects of asymmetrical posting rates. From 
these dyadic scores we derived two scores to classify blogs 
according to their overall precursor and laggard behaviors. 

The comparison of these semantic temporal metrics with 
the more traditional in-link degree based popularity metrics 
revealed non-trivial relationships between the two. The expert 
assessment indicates that the scores we proposed lead to 
relevant distinctions that could not be derived from classical 
structural based methods only. Search engine ranking algo- 
rithms, like the well-known PageRank [24 1 used by Google, 
are more sophisticated than simple reliance on URL link in- 
degrees. However, they are still based on structural aspects of 
the web, deriving their estimations from the analysis of the 
network of URL links. We found that the precursor/laggard 
scores are able to identify blogs that have a high tendency to 
be precursors in topics under discussion, but that would likely 
not be distinguishable from other blogs with similar page ranks 
or in-degrees by relying only on this later type of metric. It 
is conceivable that search engine ranking algorithms could be 
improved with the approach we propose. Including precursor 
scores in ranking metrics could help improve the quality of 
searches, for example the ones related to time sensitive events. 
It could also reward blogs that generate influential content, but 
that are not especially popular in the sense of receiving many 
in-links. 
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